Exploration of the Food Standards Agency's Data for Glasgow Establishments. The data for this analysis was downloaded using the code in ./data/data_download.ipynb
"pd" is the common short name given to the pandas library
import pandas as pd
If looking for a deeper look at getting started in pandas, there is 2017 SciPy talk here
For a quick overview, I recommend the Cheat Sheet
The read_csv function can also handle most tipes of flat files (i.e. stuff you can read with your favourite notepad application) provided that correct parametres are passed. The method handels skipping rows, parsing dates from different columns, differnt encodings, zipped sources and more.
For the full documentation: read_csv docs
df = pd.read_csv('./data/food_safety_Glasgow.csv')
There are many ways to explore the data. Some of the simplest are:
df # showing the "full" dataset
df.shape # checking the size of the dataset
df.head(2) # looking at first 2 rows. For last rows use df.tail()
df['Latitude'] # look for a specific column
df[mymask] # use a row-based mask to extract information
df.loc[639] # grabbing a sample row (using index label)
df.iloc[639] # grabbing a sample row (using integer-location index)
df.columns # looking at all column names
df.info() # summary non-nulls and types for each column
df.describe() # basic statistics (only for the numeric columns)
df.describe().loc['25%', Latitude] # go query-crazy
df.info()
df.describe().loc['25%', 'Latitude']
df[['BusinessName','Latitude']]
Where did all the other columns go? They are non-numeric :(
pandas uses matplotlib in the backend to give useful plot with one or few lines of code leaving the user with the possiblity to use pure matplotlib for more customized stuff.
For more info on pandas visualization, go here.
# uncomment to visualize matplotlib plots in the notebook (only need to run it once per notebook... usually with the imports)
%matplotlib inline
df.plot('Longitude', 'Latitude', kind='scatter', grid=True)
df[df.Longitude>0]
# From a quick google maps search: 61 Bridge St, Glasgow G5 9JB => 55.853033, -4.258319
df.loc[93, 'Longitude'] = -4.258319
df.loc[93, 'Latitude'] = 55.853033
# or.. df.loc[93, ['Longitude','Latitude']] = -4.258319, 55.853033
# showing the cleaned data
df.plot('Longitude', 'Latitude', kind='scatter', grid=True)
if it makes sense you can:
df['Latitude'].fillna(999)
df['Latitude'].fillna(method='ffill')
df['Latitude'].dropna()
is_null = pd.isnull(df.Latitude)
is_null.value_counts()
df = df.dropna(subset=['Latitude'])
df.shape
import folium
from folium.plugins import HeatMap
m = folium.Map(
location=[55.863823, -4.267681],
zoom_start=13,
tiles='cartodbpositron',
)
rating_color={
'fhis_awaiting_inspection_en-GB': 'gray',
'fhis_improvement_required_en-GB': 'red',
'fhis_pass_en-GB': 'yellow',
'fhis_pass_and_eat_safe_en-GB': 'green',
}
for _, row in df.iterrows():
clean_businessname = row.BusinessName.replace("`", "'")
info_txt = "<b>{}</b><br>(<i>{}</i>, {})".format(clean_businessname, row.RatingValue, row.RatingDate)
folium.Circle(
radius=10,
location=[row.Latitude, row.Longitude],
tooltip=info_txt,
color='black',
weight=0.5,
fill_opacity=0.8,
fill_color=rating_color[row.RatingKey],
fill=True,
).add_to(m)
m
m = folium.Map(
location=[55.863823, -4.267681],
zoom_start=12,
tiles='cartodbpositron'
)
data_needing_improvements = df.loc[df.RatingValue == 'Improvement Required', ['Latitude', 'Longitude']].values
data_pass_plus = df.loc[df.RatingValue == 'Pass and Eat Safe', ['Latitude', 'Longitude']].values
HeatMap(data=data_needing_improvements, name='need_improvements', show=False).add_to(m)
HeatMap(data=data_pass_plus, name='pass_plus', show=False).add_to(m)
folium.LayerControl().add_to(m)
m
I've downloaded and matched businesses based on location and name and saved only the matches with good enough confidence to the yelp_matched.csv file. The data has been augumented with the FHRSID of the corresponding matches.
df_yelp_matched = pd.read_csv('data/yelp_matched.csv')
df_yelp_matched.head()
# most yelp rated business
df_yelp_matched.sort_values('review_count', ascending=False).head(3)
# highest rated businesses
df_yelp_matched \
.sort_values(['rating', 'review_count'], ascending=False) \
.head()
df_combined = pd.merge(left=df, right=df_yelp_matched, on='FHRSID', how='inner')
df_combined.head()
df_combined.iloc[527]
let's create a smaller dataframe with only the stuff we are interested in
qdf = df_combined[['rating', 'review_count']].copy()
# converting price from symbols to number ('£££' to 3)
qdf['price'] = df_combined['price'] \
.fillna('') \
.apply(len) \
.replace(0, pd.np.nan)
fsa_rating_order = ['Improvement Required', 'Awaiting Inspection', 'Pass', 'Pass and Eat Safe']
qdf['fsa_rating'] = df_combined['RatingValue'].apply(fsa_rating_order.index)
qdf.head()
# let's ignore restaurants with few reviews
has_enough_reviews = qdf['review_count'] > 3
compare = qdf[has_enough_reviews].groupby('fsa_rating')['rating'].agg(('mean', 'std'))
compare.index = fsa_rating_order
compare
compare.plot.bar(y='mean', yerr='std')
mm... doesn't look like it
compare = qdf[has_enough_reviews] \
.dropna(subset=['price']) \
.groupby('price')['rating'] \
.agg({'mean', 'std', 'count'})
compare.index = ['£'*(i+1) for i in range(4)]
compare['std_err'] = compare['std'] / pd.np.sqrt(compare['std'])
compare.plot.bar(y='mean', yerr='std_err')
compare
Why such bigger error bars on the high price ones? That's probably due to sample size (less pricy restaurants than cheap ones)
qdf[has_enough_reviews]\
.dropna(subset=['price'])\
.groupby(['price', 'rating'])\
.count()\
.unstack(1)\
.fillna(0)\
.rename({i:'£'*i for i in range(1,5)})\
['review_count'].plot(kind='bar', stacked=True, grid=True, cmap='jet')
..questions?